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synthesized voices are generated by connecting these 
voice elements an element connecting section 204, 



PROBLEM TO BE SOLVED: To provide a voice COPYRIGHT: (C)1999,JPO 
synthesizing method by which synthesized voices of 
superior tone quality can be abtained, the size of a 
voice element dictionary is compact, and the change in 
voice quality is easily performed. 

SOLUTION: In a analysis section 100, the voice elements 
segmented by a pitch waveform segmenting section 101 
are inputted into an LPC analysis section 102 and 
expressed in the forms of residual signals and LPC 
coefficients. A set of these spectrum parameters and the 
residual signals is stored in a residual signal storage 
section 103 and an LPC coefficient storage section 104 
as a voice element dictionary. In an analysis section 
200, a selecting section 210 selects a set of the 
residual signals and the spectrum parameters in 
accordance with the phoneme symbol string given by a 
sentence analysis.rhythm control section. Then, voice 
elements are generated by passing them through a 
synthesis filter 202, which is constructed of the 
selected residual signals and the selected spectrum 
parameters. Then, a pitch period control by a pitch 
synchronization waveform superimposing method and a 
duration length control are conducted in a rhythm 
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[Title of the Invention] 

Voice Synthesis Method 

[Abstract ] 
[ Problems ] 

To provide a voice synthesis method in which the - 
tone quality of synthetic voice is excellent, the size 
of a voice segment dictionary is compact , and the 
change of voice quality is easy. 
[Solution] 

In an analysis part 100, a voice segment cut out 
in a pitch waveform cutting part 101 is inputted into 
an LPC analysis part 102 and expressed in terms of a 
residual signal and an LPC coefficient. A set of 
spectral pareimeter and residual signal are stored in a 
residual signal storage part 103 and an LPC coefficient 
storage part 104 as a voice segment dictionary. In an 
analysis part 200, the set of residual signal and 
spectral parameter is selected in a selection part 201 
according to a phonemic symbol string given from a 
sentence analysis and rhythm control part, the selected 
residual signal is passed through a synthetic filter 
202 conforming to the selected spectral parameter to 
generate a voice segment, the pitch period and the 
continuation time length for this voice segment are 
controlled by a pitch synchronization waveform 
convolution method in a rhythm control part 203, and 
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the voice segments are concatenated to generate a 
synthetic voice. 
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[ Claims ] 
[Claim 1] 

A voice synthesis method comprising expressing a 
voice segment in terms of a residual signal and a 
spectral parameter, passing the residual signal through 
a synthesis filter conforming to the spectral parameter 
to generate the voice segment, making a rhythm control 
for the voice segment , and concatenating the voice 
segments after the rhythm control to generate a 
synthetic voice. 
[Claim 2] 

A voice synthesis method comprising expressing a 
voice segment in terms of a residual signal and a 
spectral parameter, storing a set of spectral parameter 
and residual signal as a voice segment dictionary, 
selecting the set of residual signal and spectral 
parameter according to a given phonemic symbol string, 
passing the selected residual signal through a 
synthesis filter conforming to the selected spectral 
parameter to generate a voice segment, making a rhythm 
control for the voice segment, and concatenating the 
voice segments after the rhythm control to generate a 
synthetic voice. 
[Claim 3] 

The voice synthesis method according to claim 1 or 
2 , wherein the pitch period is controlled by applying a 
pitch synchronization waveform convolution method to 
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the voice segment obtained through the synthesis filter 
in making the rhythm control. 
[Claim 4] 

The voice synthesis method according to claim 3, 
wherein the continuation time length of the voice 
segment is further controlled in making the rhythm 
control. 

[Detailed Description of the Invention] 
[0001] 

[Field of the Invention] 

The present invention relates to a voice 
synthesis method suitable for the text- speech 
synthesis • More particularly, the invention relates 
to a voice synthesis method of generating a synthetic 
voice from information on the phonemic symbol string, 
the pitch, and the phoneme continuation time length. 
[0002] 
[Prior Art] 

Generating a voice signal from arbitrary sentence 
artificially is called a text-speech synthesis. The 
text -speech synthesis is generally performed at three 
stages by a language processing part, a phoneme 
processing part, and a voice synthesis part. The 
input text, first of all, is subjected to the 
morphological analysis and the syntactic analysis in 
the language processing part. Next, the accent and 
the intonation are processed in the phoneme 
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processing part, whereby information on the phonemic 
symbol string, the pitch, and the phoneme 
continuation time length is output ted. Finally, a 
synthetic voice is generated from information on the 
phonemic symbol string, the pitch, and the phoneme 
continuation time length in the voice signal 
synthesis part. 
[0003] 

A voice synthesis method usable for such text- 
speech synthesis must be a method by which the voice 
can be synthesized in an arbitrary rhythm for an 
arbitrary phonemic symbol string. The voice 
synthesis methods for enabling the arbitrary phonemic 
symbol string to be synthesized as the voice are 
divided roughly into an LPC analysis synthesis method 
and a waveform editing method. 
[0004] 

The LPC analysis synthesis method involves 
applying the LPC analysis to a voice signal to acquire 
an LPC spectral parameter and a residual signal and 
making the rhythm control and concatenation at the 
level of the residual signal, as introduced in 
literature (1) : Ito and Sato, "Pitch control method 
for voice synthesis using the cut out residual", Sound 
theory 2-7-18(1989-3), for example. This method has 
the advantage that the change of voice quality is easy 
by the operation of the LPC coefficient, and the size 
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of a voice segment dictionary for the synthesis is 
comparatively small. However, the tone quality of 
synthetic voice is poor because the synthetic voice is 
so-called nasal voice and lacks distinctness. 
[0005] 

On the other hand, the waveform editing method is 
the one for synthesizing a voice by changing the pitch 
period or continuation time length of voice segments 
cut out from the actual voice waveform and 
concatenating the voice segments, as introduced in 
literature (2) : Hirokawa, Hakoda and Sato, "A waveform 
selection method for waveform editing synthesis in view 
of spectral continuity", Sound theory 2-6-10 (1990-9), 
literature (3) : Iwata et al., "Japanese text voice 
synthesis for personal computer software", Sound theory 
2-8-13 (1993-10), and literature (4) : Koyama and 
Koizumi, "Review on waveform rule synthesis method 
having a basic unit of VCV" , Shingaku technical report , 
SP96-8 (1996-5), for example. With this method, the 
sound quality is relatively easy to enhance, and has 
been vigorously examined. 
[0006] 

In addition, from the standpoint that the signal 
processing for analysis and synthesis should not be 
performed to enhance the sound quality, a method has 
been offered in which the voice waveforms where the 
phonemic environment and the rhythm environment are 
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consistent are concatenated from a database of natural 
voice in the longest unit (literature (5), N. Campbell 
and A, W. Black: "CHATR, Arbitrary voice synthesis 
system of natural voice waveform concatenation type 
",Shingaku technical report SP96-7 ( 1996-5) " • 
[0007] 

These methods have the advantage that a synthetic 
voice of higher sound quality can be produced than the 
analysis synthesis method, but a problem that the size 
of the voice segment dictionary is greater. Also, 
there is a problem that the change of voice quality is 
difficult because the spectral parameter is not 
expressed explicitly. 
[0008] 

This invention has been achieved to solve the 
abovementioned problems with the prior art, and it is 
an object of the invention to provide a voice synthesis 
method in which the tone quality of synthetic voice is 
excellent, the size of a voice segment dictionary is 
compact, and the change of voice quality is easy. 
[0009] 

[Means for Solving the Problems] 

In order to achieve the above object, the 
invention provides a voice synthesis method comprising 
expressing a voice segment in terms of a residual 
signal and a spectral parameter such as an LPC 
coefficient, passing the residual signal through a 
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synthesis filter conforming to ttie spectral parameter 
to generate a voice segment , making a rhythm control 
for the voice segment, and concatenating the voice 
segments after the rhythm control to generate a 
synthetic voice. 
[0010] 

More specifically, a voice synthesis method 
comprises expressing a voice segment in terms of a 
residual signal and a spectral parameter, storing a set 
of spectral parameter and residual signal as a voice 
segment dictionary, selecting the set of residual 
signal and spectral parameter according to a given 
phonemic symbol string, passing the selected residual 
signal through a synthesis filter conforming to the 
selected spectral parameter to generate a voice segment, 
making a rhythm control for this voice segment, and 
concatenating the voice segments after the rhythm 
control to generate a synthetic voice. 
[0011] 

It is preferred that the pitch period is 
controlled by applying a pitch synchronization waveform 
convolution method to the voice segment obtained 
through the synthesis filter in making the rhythm 
control. The continuation time length of the voice 
segment may be further controlled in making the rhythm 
control. 
[0012] 
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With this voice synthesis method of the invention, 
the rhythm control is performed at the level of voice 
segment, and the voice segments after the rhythm 
control are concatenated, whereby the synthetic voice 
of the equivalent tone quality to that of the waveform 
editing method is produced, although the conventional 
voice synthesis method of residual driving system has 
the rhythm control at the level of residual signal. 
[0013] 

In this case, if the pitch synchronization 
waveform convolution method is employed for the control 
of pitch period in the rhythm control, the distinct 
voice synthesis of higher tone quality is enabled. In 
the invention, since the voice segment prepared as the 
voice segment dictionary is expressed in terms of the 
set of residual signal and spectral parameter such as 
LPC coefficient, the size of the voice segment 
dictionary is compact. 
[0014] 

In this way, since the voice segment is expressed 
in terms of the set of spectral parameter and residual 
signal, the voice quality of the synthetic voice can be 
easily changed by the operation of the spectral 
parameter. 
[0015] 

[Embodiments of the Invention] 
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The preferred embodiments of the present invention 
will be described below with reference to the 
accompanying drawings . Figure 1 is a block diagram 
showing a configuration of a text voice synthesis 
system to which a voice synthesis method according to 
an embodiment of the invention is applied. This voice 
synthesis system is roughly composed of an analysis 
part 100 and a synthesis part 200. 
[0016] 

The analysis part 100 comprises a pitch waveform 
cutting part 101 for cutting out a pitch waveform 
from an input voice waveform, an LPC analysis part 
102 for making the LPC analysis (linear prediction 
analysis) of the cut out patch waveform to extract a 
residual signal and an LPC coefficient that is a 
spectral parameter, and a residual signal storage 
part 103 and an LPC coefficient storage part 104 for 
storing a set of the residual signal and LPC 
coefficient extracted by the LPC analysis part 102 as 
a voice segment dictionary. 
[0017] 

On the other hand, the synthesis part 200 
comprises a voice segment selection part 201 for 
taking out a set of residual signal and LPC 
coefficient corresponding to an individual phonemic 
symbol from the residual signal storage part 103 and 
the LPC coefficient storage part 104 in the analysis 
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part 100 according to a phonemic symbol string 
obtained by analyzing a text used for text synthesis 
in a sentence analysis /rhythm control part, not shown, 
a synthesis filter 202 for generating a voice segment 
by inputting the selected residual signal, the 
synthesis filter conforming to the selected LPC 
coefficient, a rhythm control part 203 for making the 
rhythm control for the generated voice segment 
according to information on the pitch period and the 
continuation time length given from the sentence 
analysis and rhythm control part, and a segment 
concatenation part 204 for concatenating the voice 
segments after the rhythm control to generate a 
synthetic voice. 
[0018] 

Referring to a flowchart of Figure 2, a detailed 
processing procedure of the analysis part 100 will be 
described below. First of all, a voice waveform is 
inputted into the analysis part 100 (step Sll). This 
voice waveform may be a representative voice segment 
generated as will be described later. 
[0019] 

Next, the pitch waveform cutting part 101 cuts 
out a waveform of pitch period by multiplying the 
input voice waveform by a window function of pitch 
period length, and the LPC analysis part 102 makes 
the pitch synchronization LPC analysis (steps S12 and 
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S13). In this case, since the discrete spectra of 
voice waveform are smoothed by the window function, a 
spectral envelope on which a basic frequency has less 
influence is obtained. 
[0020] 

As a result of the LPC analysis at step S12, the 
voice segment is expressed in terms of the set of 
residual signal and LPC coefficient in a pitch period 
unit. The residual signal is stored in the residual 
signal storage part 103, and the LPC coefficient is 
stored in the LPC coefficient storage part 104, in 
which both the residual signal and the LPC 
coefficient are associated with each other as a voice 
segment dictionary (step S14). 
[0021] 

Referring to a flowchart of Figure 3, a detailed 
processing procedure of the synthesis part 200 will 
be described below. In the voice synthesis, a 
phonemic symbol string and information on the pitch 
period and the continuation time length (phoneme 
continuation time length) are given from the sentence 
analysis and rhythm control part, not shown. First 
of all, the set of residual signal and LPC 
coefficient corresponding to an individual phonemic 
symbol is selected and read in the selection part 201 
from the residual signal storage part 103 and the LPC 
coefficient storage part 104 composing the voice 
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segment dictionary according to the phonemic symbol 

string (step S21). 

[0022] 

Next, the synthesis filter 202 is configured 
conforming to the LPC coefficient selected at step 
S21, and the residual signal selected at step S21 is 
passed through the synthesis filter 202 to generate a 
voice segment (steps S22 and S23). 
[0023] 

Next, the rhythm control part 203 makes the 
rhythm control, or the control for the pitch period 
and continuation time length, for the voice segment 
generated at step S23 according to information on the 
pitch period and the continuation time length given 
from the sentence analysis and rhythm control part. 
[ 0024] 

Specifically, first of all, the pitch period is 
controlled by applying a pitch synchronization 
waveform overlap-add technique (PSOLA) like a 
waveform editing method to the voice segment 
generated at step S23 (step S24). The pitch 
synchronization waveform overlap-add technique is 
well known as described in literature (6) : F. 
Charpentier and M. Stella, "Diphone Synthesis Using 
an Overlap -add Technique for Speech Waveforms 
concatenation", Proc. ICASSP 86, pp. 2015-2018 (1986), 
for example. In this embodiment, to allow voice 
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syntliesis of high tone quality, the pitch period is 
controlled based on the pitch synchronization 
waveform overlap- add technique in the following way. 
[0025] 

Generally, the tone quality of synthetic voice 
greatly depends on the smoothness of voiced sound. 
Thus, in this embodiment, to make the change of pitch 
period smoother, the given pitch period is 
interpolated for a sample unit. Supposing the 
central time for the j-th frame and the (j+l)-th 
frame to be ti, t2, and the pitch period to be pi, p2, 
when the pitch period is linearly changed, the pitch 
period p (t) at time t is represented by the 
following expression . 
[0026] 

[ Equation 1 ] 

P (t) = { (t-ti)p2+(t2-t)pi}/(t2-ti) (1) 
Supposing the pitch mark position from ti to t2 to be 
mk (k=l, 2, . . ,N) , the following expression holds. 
[00271 

[Equation 2 ] 




(2) 



From the equations (1) and (2), the following 



expression is obtained. 



[0028] 



[Equation 3] 
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mk=ni)c-i+(m)c-i, +a) (e*"-!) (3) 
a= ( tzPi- 1 1P2 ) / ( P2-P1 ) ( 4 ) 

b=(p2-Pi)/(t2-ti) (5) 
[0029] 

The control of pitch period in the rhythm control 
part 203 is made by convoluting the voice segments 
generated through the synthesis filter 202 on the 
basis of the pitch mark position acquired in this way. 
That is, the top of voice segment is aligned at each 
pitch mark position on the time axis, and the voice 
segments are convoluted with a zero signal. In this 
case, an overlap portion of adjacent voice segments 
corresponding to each pitch mark position is added, 
and the non- overlap portion remains the original 
voice segment . 
[0030] 

In the rhythm control part 202, the control of 
the continuation time length is further made (step 
S25). In the control of the continuation time length, 
it is important how the pitch marks of the original 
voice waveform and the synthetic voice waveform are 
associated. In this embodiment, the temporal mapping 
is performed using a function for associating them. 
With this method, a mapping function is appropriately 
defined, so that the thinning and interpolation of 
pitch waveform can be arbitrarily controlled 
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according to the property of voice segments to be 

concatenated. 

[0031] 

Next, the voice segments for which the rhythm 
control (control of pitch period and continuation 
time length) is made in the above way are concatenated 
(step S26). In this embodiment, to reduce a distortion 
caused by discontinuity of waveform at the concatenated 
part, the CV and VC segments are employed as the voice 
segments , whereby the voice segments are concatenated 
at the vowel steady part. At this time, the pitch 
waveforms of vowel to be concatenated are added with 
weight over the entire vowel section. In this way, a 
synthetic voice in which an arbitrary sentence (text) 
is translated into the voice signal is produced. 
[0032] 

Next, a learning method of voice segments 
according to the invention will be described below. 
Conventionally, the generation of voice segments relied 
on a trial and error technique made manually, and it 
was required that the skilled researcher repeatedly 
performed a series of operations of cutting out the 
voice segment from the voice data vocalized in single 
sound, meaningless word or continuous word over the 
long time and evaluating the synthetic voice. 
[0033] 
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On the other hand, a method of automatically 
generating the voice segments from the voice database 
is a well known phoneme environment clustering (COC: 
Context Oriented Clustering) method, as disclosed in 
literature (7) : Nakajima and Hamada, "Rule synthesis 
method with clustering based on recent sound state", 
Shingaku theory, D-II, vol. J-72-D-II, No. 8, pp. 1177- 
1179 (1989-8), for example. This method comprises 
clustering the voice segments cut out from the voice 
database based on a dispersion of spectral parameter 
under the restraint condition of phoneme environment 
and making the centroid of each cluster a 
representative voice segment . 
[0034] 

This phoneme environment clustering method has a 
feature that the representative voice segment can be 
determined based on a statistical evaluation criterion 
without relying on the foreknowledge, but does not 
consider a distortion caused by the control of pitch 
period that is problematical in the voice synthesis , 
whereby the tone quality of synthetic voice is not 
necessarily sufficient. 
[0035] 

Thus, a learning method of representative voice 
segment will be described below in which the distortion 
of synthetic voice is defined, including the distortion 
caused by making the rhythm control (control of pitch 
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period and continuation time length) and the distortion 

is minimized. 

[0036] 

Figure 4 is a block diagram showing a closed loop 
learning system for the representative voice segment 
according to this embodiment . Though this learning 
method can be practically applied to various 
synthesizers or synthetic units, an instance where the 
learning method is applied to learning the CV and VC 
voice segments for use in the voice synthesis system 
will be described here. This method comprises 
obtaining the LPC coefficient and the residual signal 
for the synthesis filter after generating the voice 
segment by learning . 
[0037] 

In learning, first of all, as the preparation, a 
large amount of voice segments in a voice synthesis 
unit are cut out from the voice database 401, and made 
the representative voice segment candidates 402. At 
the same time, the training data 403 to be learned is 
generated in the same way. Next, the pitch period and 
the continuation time length of the representative 
voice segment candidate are analyzed (404), and the 
pitch period and the continuation time length of the 
representative voice segment candidate are analyzed 
with the training data 403 as target and changed (405) 
to synthesize the voice segments. In this way, the 
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voice segments are generated for all the combinations 
of tlie representative voice segment candidates 402 and 
the training data. 
[0038] 

Next, the distortion of generated voice segment 
from the training data is calculated and evaluated 
(405), and the representative voice segment in which 
the total sum of distortions for all the training 
data is minimized is searched and selected from among 
the representative voice segment candidates (406). 
This selected representative voice segment candidate 
is made the representative segment. 
[0039] 

This learning method is called a closed loop 
learning in the sense of feeding back the evaluation 
result of synthesized voice segments to the learning 
of voice segment. In the following, a distortion 
scale and a selection method of the representative 
voice segment, which are important in this learning 
method, will be described using a specific example. 
[0040] 

( Distortion scale ) 

The distortion scale of learning is required to 
reflect the result of subjective evaluation 
effectively. Since the power of synthetic voice is 
controlled in a voice synthesis system, it is 
necessary that the representative voice segment is 
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evaluated at the level where the power is normalized. 
In view of this, the distortion of synthetic voice 
segment is defined by the following expression. 
[0041] 

[Equation 4] 

e,, = S (r'^ (n) - s\, (n) f ( 6 ) 

r'j(n) =r,(n)/(Srj(k)^)'^" (7) 
s'i j { n) = s^/n) / (S s^^if f^^ ( 8 ) 

[0042] 

Where r^ designates the training data, and Sij 
designates the synthetic voice segment for the 
representative voice segment candidate Ui with r-, as 
the target. 

(Selection of representative voice segment) 
Supposing that the number of representative voice 
segments per synthetic unit is n and the number of 
representative voice segment candidates is N, the 
selection of representative voice segment is a 
problem of searching one set of representative voice 
segments minimizing the following cost function for 
all the combinations of choosing n from N candidates. 
[0043] 

[Equation 5] 

1 M 

ij = - S min(e,^j, e^.,) (9) 

[0044] 
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Where M is the number of training data. If the 
set of representative voice segments minimizing the 
cost function of formula (9) is obtained, all the 
training data can be clustered into clusters 
corresponding to the representative voice segments . 
[0045] 

Figure 5 shows an example of selecting two 
representative voice segments from four 
representative voice segment candidates . In this 
example, the cost function of a combination of U2 and 
U3 is minimized among any two combinations of ui to U4. 
As a result, U2 and U3 are selected as the 
representative voice segments. 
[0046] 

( Evaluation experiment ) 

An experiment was conducted for generating one 
representative voice segment for each synthetic unit 
by the above method, with the diphone of CV and VC as 
the synthetic unit. By inspection, the voice segment 
data and the representative voice segment candidate 
used for training were cut out from the voice 
database with a phoneme label attached, whereby a 
total of 302 CV and VC representative voice segments 
were generated by the closed loop learning method. 
It tool about 1.5 hours to make the learning by Sun- 
Ultra2 . 
[0047] 
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Figure 6 shows the value of the cost function 
when the number of voice segments per synthetic unit 
(CV, VC) increases. From this graph, it will be 
found that the distortion of synthetic voice 
decreases monotonically as the number of voice 
segments increases . 
[0048] 

Conventionally, it is well known that the tone 
quality of synthetic sound is improved by employing the 
voice segments according to the power or pitch. 
However, with the conventional trial and error method, 
it took a lot of labor and time to generate the 
representative voice segments, and it was not easy to 
increase the number of representative voice segments. 
[0049] 

On the contrary, with the above closed loop 
learning method, if the labeled voice data is given, 
the voice segment is automatically generated in a short 
time, whereby it is easy to generate any number of 
representative voice segments. And the selection of 
voice segment is not made according to the prior 
knowledge such as power or pitch, but a selection rule 
can be created by the distortion scale of synthetic 
voice. That is, the selection rule of voice segment 
can be created by clustering the training data into 
clusters of selected representative voice segments and 
extracting a common factor within the cluster. 
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[0050] 

Next, the tone quality of synthetic voice obtained 
in the above voice synthesis system was evaluated. The 
generated representative voice segment was inputted as 
the voice input of Figure 1 to the analysis part, 
decomposed by the pitch wavefoirm cutting part 101 and 
the LPC analysis part 102, and stored in terms of the 
residual signal and the LPC coefficient as the voice 
segment dictionary in the residual signal storage 
part 103 and the LPC coefficient storage part 104. 
When stored, the residual signal and the LPC 
coefficient were encoded by applying a vector-scalar 
quantization technique. As a result, the data amount 
was as small as about 150 kbytes per speaker, which 
was one-tenth to one- twentieth as compared with the 
waveform editing method. Accordingly, the voice 
synthesis system of this embodiment is easily 
incorporated into a portable information terminal 
such as PDA or a car navigation system. 
[0051] 

The subjective evaluation was made at seven 
stages (-3: very bad to +3: very good) by general 
subjects, or a total of ten persons (men and women of 
same number) including seven university students. As 
a result, the tone quality of synthetic voice 
obtained by the voice synthesis system of this 
embodiment is more excellent by 2 . 5 points on average 
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in man and woman speakers and various sentences than 
the voice synthesis system of the conventional 
cepstrum synthesis method. From the subjects, it is 
evaluated that the distinctness is greatly improved, 
and the tone quality is soft and close to natural 
voice. 
[0052] 

[Advantages of the Invention] 

As described above, with the voice synthesis 
method of the invention, since the voice segment is 
expressed in terms of a set of residual signal and 
spectral parameter such as LPC coefficient , and the 
rhythm control is made for the voice segment generated 
by the residual signal and the spectral parameter, the 
distinct synthetic voice of high tone quality can be 
generated, the change of voice quality is easily made 
by the operation of spectral parameter, and the size of 
the voice segment dictionary is made compact. 
[Brief Description of the Drawings] 
[Figure 1] 

Figure 1 is a block diagreim showing a 
configuration ^f a voice synthesis system according to 
one embodiment of the present invention. 
[Figure 2] 

Figure 2 is a flowchart showing a processing 
procedure of the analysis side in this embodiment. 
[Figure 3] 
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Figure 3 Is a flowchart showing a processing 
procedure of the synthesis side in this embodiment . 
[Figure 4] 

Figure 4 is a block diagram for explaining a 
closed loop learning system of representative voice 
segment . 
[Figure 5] 

Figure 5 is a diagram showing a selection example 
of representative voice segment based on a distortion 
of synthetic voice segment. 
[Figure 6] 

Figure 6 is a diagram showing the relationship 
between the number of representative voice segments and 
the cost function. 
[Description of Symbols] 

100 . . . voice analysis part 

101 . . . pitch waveform cutting part 

102 . . . LPC analysis part 

103 . . . residual signal storage part 

104 . • . LPC coefficient storage part 

200 . . . voice synthesis part 

201 . . . selection part 

202 . . . LPC synthesis filter 

203 . . . Rhythm control part 

204 . . . voice segment concatenation part 
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Figure 1 

101 Pitch waveform cutting part 

102 LPC analysis part 

103 Residual signal part 

104 LPC coefficient 

201 Selection 

202 Synthesis filter 

203 Rhythm control 

204 Concatenation of voice segments 
#1 Input voice 

#2 Phonemic symbol string 

#3 Synthesis 

#4 Pitch period and continuation time length 

#5 Synthetic voice 

#6 Analysis 

Figure 2 

511 Input voice. 

512 Cut out pitch waveform. 

51 3 Pitch synchronization LPC analysis 

514 Store a set of residual signal and LPC coefficient 
as voice segment dictionary. 

#1 Start 
#2 End 

Figure 3 
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521 Select a set of residual signal and LPC 
coefficient from voice segment dictionary according to 
phonemic symbol string. 

522 Configure a synthesis filter based on LPC 
coefficient. 

523 Generate voice segment by passing residual signal 
through synthesis filter. 

524 Control pitch period by pitch synchronization 
waveform convolution method. 

525 Control continuation time length. 

526 Output synthetic voice by concatenating voice 
segments . 

#1 Start 
#2 End 

Figure 4 

401 Voice database 

402 Representative voice segment candidate 

403 Training data 

404 Analysis of pitch continuation time length 

405 Change of pitch continuation time length 

406 Distortion evaluation 

407 Minimum distortion search 

408 Representative voice segment 

Figure 5 

#1 Representative segment candidate 
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#2 Training data 
Figure 6 

#1 Value of cost function 
#2 Number of voice segments 
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